This RMarkdown file contains the report of the data analysis done for the project on forecasting daily bike rental demand using time series models in R. It contains analysis such as data exploration, summary statistics and building the time series models.
Data Description:
The dataset contains the daily count of rental bike transactions between years 2011 and 2012 in Capital bikeshare system with the corresponding weather and seasonal information.
Dataset have the following fields:
- instant: record index
- dteday : date
- season : season (1:winter, 2:spring, 3:summer, 4:fall)
- yr : year (0: 2011, 1:2012)
- mnth : month ( 1 to 12)
- holiday : weather day is holiday or not (extracted from http://dchr.dc.gov/page/holiday-schedule)
- weekday : day of the week
- workingday : if day is neither weekend nor holiday is 1, otherwise is 0.
+ weathersit :
- 1: Clear, Few clouds, Partly cloudy, Partly cloudy
- 2: Mist + Cloudy, Mist + Broken clouds, Mist + Few clouds, Mist
- 3: Light Snow, Light Rain + Thunderstorm + Scattered clouds, Light Rain + Scattered clouds
- 4: Heavy Rain + Ice Pallets + Thunderstorm + Mist, Snow + Fog
- temp : Normalized temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-8, t_max=+39 (only in hourly scale)
- atemp: Normalized feeling temperature in Celsius. The values are derived via (t-t_min)/(t_max-t_min), t_min=-16, t_max=+50 (only in hourly scale)
- hum: Normalized humidity. The values are divided to 100 (max)
- windspeed: Normalized wind speed. The values are divided to 67 (max)
- casual: count of casual users
- registered: count of registered users
- cnt: count of total rental bikes including both casual and registered
Data Source: https://archive.ics.uci.edu/ml/datasets/bike+sharing+dataset
## Import required packages
install.packages("timetk")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("lubridate")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
install.packages("dplyr")
## Installing package into '/cloud/lib/x86_64-pc-linux-gnu-library/4.4'
## (as 'lib' is unspecified)
library(timetk)
library(lubridate)
##
## Attaching package: 'lubridate'
## The following objects are masked from 'package:base':
##
## date, intersect, setdiff, union
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(forecast)
## Registered S3 method overwritten by 'quantmod':
## method from
## as.zoo.data.frame zoo
library(TTR)
library(tseries) # For adf.test()
# Read the CSV files
dfd <- read.csv("day.csv")
print("Structure of dfd")
## [1] "Structure of dfd"
str(dfd)
## 'data.frame': 731 obs. of 16 variables:
## $ instant : int 1 2 3 4 5 6 7 8 9 10 ...
## $ dteday : chr "2011-01-01" "2011-01-02" "2011-01-03" "2011-01-04" ...
## $ season : int 1 1 1 1 1 1 1 1 1 1 ...
## $ yr : int 0 0 0 0 0 0 0 0 0 0 ...
## $ mnth : int 1 1 1 1 1 1 1 1 1 1 ...
## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...
## $ weekday : int 6 0 1 2 3 4 5 6 0 1 ...
## $ workingday: int 0 0 1 1 1 1 1 0 0 1 ...
## $ weathersit: int 2 2 1 1 1 1 2 2 1 1 ...
## $ temp : num 0.344 0.363 0.196 0.2 0.227 ...
## $ atemp : num 0.364 0.354 0.189 0.212 0.229 ...
## $ hum : num 0.806 0.696 0.437 0.59 0.437 ...
## $ windspeed : num 0.16 0.249 0.248 0.16 0.187 ...
## $ casual : int 331 131 120 108 82 88 148 68 54 41 ...
## $ registered: int 654 670 1229 1454 1518 1518 1362 891 768 1280 ...
## $ cnt : int 985 801 1349 1562 1600 1606 1510 959 822 1321 ...
print("Summary of dfd")
## [1] "Summary of dfd"
summary(dfd)
## instant dteday season yr
## Min. : 1.0 Length:731 Min. :1.000 Min. :0.0000
## 1st Qu.:183.5 Class :character 1st Qu.:2.000 1st Qu.:0.0000
## Median :366.0 Mode :character Median :3.000 Median :1.0000
## Mean :366.0 Mean :2.497 Mean :0.5007
## 3rd Qu.:548.5 3rd Qu.:3.000 3rd Qu.:1.0000
## Max. :731.0 Max. :4.000 Max. :1.0000
## mnth holiday weekday workingday
## Min. : 1.00 Min. :0.00000 Min. :0.000 Min. :0.000
## 1st Qu.: 4.00 1st Qu.:0.00000 1st Qu.:1.000 1st Qu.:0.000
## Median : 7.00 Median :0.00000 Median :3.000 Median :1.000
## Mean : 6.52 Mean :0.02873 Mean :2.997 Mean :0.684
## 3rd Qu.:10.00 3rd Qu.:0.00000 3rd Qu.:5.000 3rd Qu.:1.000
## Max. :12.00 Max. :1.00000 Max. :6.000 Max. :1.000
## weathersit temp atemp hum
## Min. :1.000 Min. :0.05913 Min. :0.07907 Min. :0.0000
## 1st Qu.:1.000 1st Qu.:0.33708 1st Qu.:0.33784 1st Qu.:0.5200
## Median :1.000 Median :0.49833 Median :0.48673 Median :0.6267
## Mean :1.395 Mean :0.49538 Mean :0.47435 Mean :0.6279
## 3rd Qu.:2.000 3rd Qu.:0.65542 3rd Qu.:0.60860 3rd Qu.:0.7302
## Max. :3.000 Max. :0.86167 Max. :0.84090 Max. :0.9725
## windspeed casual registered cnt
## Min. :0.02239 Min. : 2.0 Min. : 20 Min. : 22
## 1st Qu.:0.13495 1st Qu.: 315.5 1st Qu.:2497 1st Qu.:3152
## Median :0.18097 Median : 713.0 Median :3662 Median :4548
## Mean :0.19049 Mean : 848.2 Mean :3656 Mean :4504
## 3rd Qu.:0.23321 3rd Qu.:1096.0 3rd Qu.:4776 3rd Qu.:5956
## Max. :0.50746 Max. :3410.0 Max. :6946 Max. :8714
Correlation between the normalized temperature and normalized feeling temperature and the total count of bike rentals:
cor(dfd$temp, dfd$cnt)
## [1] 0.627494
cor(dfd$atemp, dfd$cnt)
## [1] 0.6310657
Mean and median temperatures for different seasons (Winter, Fall, Summer, and Spring)
aggregate(temp ~ season, data = dfd, FUN = function(x) c(mean = mean(x), median = median(x)))
## season temp.mean temp.median
## 1 1 0.2977475 0.2858330
## 2 2 0.5444052 0.5620835
## 3 3 0.7063093 0.7145830
## 4 4 0.4229060 0.4091665
The mean temperature, humidity, wind speed, and total rentals per month
library(dplyr)
dfd$dteday <- as.Date(dfd$dteday)
monthly_summary <- dfd %>%
mutate(month = format(dteday, "%Y-%m")) %>%
group_by(month) %>%
summarise(mean_temp = mean(temp, na.rm = TRUE),
mean_humidity = mean(hum, na.rm = TRUE),
mean_wind_speed = mean(windspeed, na.rm = TRUE),
total_rentals = sum(cnt, na.rm = TRUE))
print(monthly_summary)
## # A tibble: 24 × 5
## month mean_temp mean_humidity mean_wind_speed total_rentals
## <chr> <dbl> <dbl> <dbl> <int>
## 1 2011-01 0.198 0.584 0.195 38189
## 2 2011-02 0.283 0.560 0.229 48215
## 3 2011-03 0.332 0.569 0.232 64045
## 4 2011-04 0.471 0.668 0.244 94870
## 5 2011-05 0.577 0.713 0.181 135821
## 6 2011-06 0.693 0.593 0.178 143512
## 7 2011-07 0.759 0.590 0.172 141341
## 8 2011-08 0.705 0.627 0.191 136691
## 9 2011-09 0.613 0.784 0.153 127418
## 10 2011-10 0.470 0.707 0.176 123511
## # ℹ 14 more rows
Temperature associated with bike rentals (registered vs. casual)
cor(dfd$temp, dfd$registered)
## [1] 0.540012
cor(dfd$temp, dfd$casual)
## [1] 0.5432847
boxplot(temp ~ season, data = dfd,
main = "Temperature by Season",
xlab = "Season",
ylab = "Normalized Temperature",
col = c("blue", "green", "red", "yellow"))
library(ggplot2)
# Convert month to a proper date format for plotting
monthly_summary$month <- as.Date(paste0(monthly_summary$month, "-01"))
# Plotting
ggplot(monthly_summary, aes(x = month)) +
geom_line(aes(y = mean_temp, color = "Mean Temperature")) +
geom_line(aes(y = total_rentals/100000, color = "Total Rentals (scaled)")) + # Scaling rentals for visibility
labs(title = "Mean Temperature and Total Rentals per Month",
x = "Month",
y = "Value") +
scale_y_continuous(sec.axis = sec_axis(~.*100, name = "Total Rentals")) +
theme_minimal()
dfd$date <- as.Date(dfd$dteday) # Ensure the date column is in Date format
dfd <- dfd %>%
mutate(year = year(dteday), month = month(dteday))
# Aggregate data by date to get total rentals per day
daily_rentals <- dfd %>%
group_by(dteday) %>%
summarise(total_rentals = sum(cnt, na.rm = TRUE))
# Verify correctness
head(daily_rentals) # Check the first few rows of the aggregated data
## # A tibble: 6 × 2
## dteday total_rentals
## <date> <int>
## 1 2011-01-01 985
## 2 2011-01-02 801
## 3 2011-01-03 1349
## 4 2011-01-04 1562
## 5 2011-01-05 1600
## 6 2011-01-06 1606
str(daily_rentals) # Check the structure of the data
## tibble [731 × 2] (S3: tbl_df/tbl/data.frame)
## $ dteday : Date[1:731], format: "2011-01-01" "2011-01-02" ...
## $ total_rentals: int [1:731] 985 801 1349 1562 1600 1606 1510 959 822 1321 ...
# Create an interactive time series plot
plot_time_series(
.data = daily_rentals,
.date_var = dteday,
.value = total_rentals,
.interactive = TRUE,
.plotly_slider = TRUE,
.title = "Total Daily Bike Rentals"
#.y_label = "Total Rentals"
)
# Seasonal diagnostics
plot_seasonal_diagnostics(
.data = daily_rentals,
.date_var = dteday,
.value = total_rentals
)
# Anomaly diagnostics
plot_anomaly_diagnostics(
.data = daily_rentals,
.date_var = dteday,
.value = total_rentals
)
## frequency = 7 observations per 1 week
## trend = 92 observations per 3 months
# Ensure the dteday column is of Date type and total_rentals is numeric
daily_rentals$dteday <- as.Date(daily_rentals$dteday)
daily_rentals$total_rentals <- as.numeric(daily_rentals$total_rentals)
# Step 1: Clean the time series data
# Create a time series object
ts_data <- ts(daily_rentals$total_rentals, frequency = 365, start = c(year(min(daily_rentals$dteday)), yday(min(daily_rentals$dteday))))
# Clean the time series data
cleaned_ts <- tsclean(ts_data)
# Step 2: Apply Simple Exponential Smoothing
ses_model <- HoltWinters(cleaned_ts, gamma = FALSE) # Simple Exponential Smoothing
# Step 3: Apply Simple Moving Average with order 10
sma_values <- SMA(cleaned_ts, n = 10)
# Step 4: Plotting the results
plot.ts(cleaned_ts, main = "Cleaned Time Series Data", ylab = "Total Rentals", xlab = "Time")
lines(ses_model$fitted[,1], col = "blue", lwd = 2, lty = 2) # Fitted values from SES
lines(sma_values, col = "red", lwd = 2) # SMA line
legend("topright", legend = c("Cleaned Data", "SES Fitted", "SMA"), col = c("black", "blue", "red"), lty = c(1, 2, 1), lwd = 2)
To decompose time series data and assess its stationarity, we will follow these steps:
Step 1: Decompose the Time Series
we will use the decompose() or stl() functions to break down the time
series into its components: trend, seasonal, and remainder.
Step 2: Assess Stationarity
To check if the time series is stationary, we will use:
Step 3: Make the Time Series Stationary
If the time series is not stationary, we will apply differencing using
the diff() function.
# Create a time series object
ts_data <- ts(daily_rentals$total_rentals, frequency = 365, start = c(year(min(daily_rentals$dteday)), yday(min(daily_rentals$dteday))))
# Step 1: Decompose the time series
decomposed <- stl(ts_data, s.window = "periodic") # Use stl for seasonal decomposition
# Plot the decomposed components
plot(decomposed)
# Extract the seasonal component
seasonal_component <- decomposed$time.series[, "seasonal"]
# Create a stationary time series by removing the seasonal component
stationary_ts <- ts_data - seasonal_component
# Step 2: Assess stationarity
# Plot ACF and PACF
par(mfrow = c(1, 2))
acf(stationary_ts, main = "ACF of Stationary Series")
pacf(stationary_ts, main = "PACF of Stationary Series")
# Perform Augmented Dickey-Fuller test
adf_test_result <- adf.test(stationary_ts, alternative = "stationary")
print(adf_test_result)
##
## Augmented Dickey-Fuller Test
##
## data: stationary_ts
## Dickey-Fuller = -3.7756, Lag order = 9, p-value = 0.02022
## alternative hypothesis: stationary
# Step 3: Differencing if not stationary
# Check if ADF test p-value is greater than 0.05 (not stationary)
if (adf_test_result$p.value > 0.05) {
differenced_ts <- diff(stationary_ts)
# Plot differenced time series
plot(differenced_ts, main = "Differenced Time Series", ylab = "Differenced Rentals")
# Recheck stationarity
adf_test_result_diff <- adf.test(differenced_ts, alternative = "stationary")
print(adf_test_result_diff)
} else {
message("The time series is already stationary.")
}
## The time series is already stationary.
Explanation of Steps
Decompose the Time Series:
1- We used stl() to decompose the time series. This function allows us
to view the trend, seasonal, and remainder components.
2- Plot the decomposition to visualize these components.
Assess Stationarity:
1- Plot the ACF and PACF to visually inspect the autocorrelation
structure of the series.
2- Use adf.test() to statistically assess stationarity. A p-value >
0.05 indicates that the series is not stationary.
Make the Time Series Stationary:
1- If the series is not stationary, apply the diff() function to
difference the series and remove trends. This often helps to stabilize
the mean of the time series.
2- After differencing, recheck stationarity using the ADF test.
To fit and forecast time series data using ARIMA models, we will follow these steps in R. This process includes fitting both manual and automatic ARIMA models, checking the residuals, and making forecasts.
Step 1: Fit ARIMA Models
we will fit both manual ARIMA models using the arima() function and
automatic models using auto.arima() from the forecast package.
Step 2: Check Residuals
After fitting the models, we will check the residuals to ensure they are
normally distributed and uncorrelated.
Step 3: Forecasting
we will use the forecast() function to predict future values and compare
the results of the manual and automatic models.
# Create a time series object
ts_data <- ts(daily_rentals$total_rentals, frequency = 365, start = c(year(min(daily_rentals$dteday)), yday(min(daily_rentals$dteday))))
# Step 1: Fit Manual ARIMA Model
manual_arima_model <- arima(ts_data, order = c(1, 1, 1)) # Change (p,d,q) as necessary
summary(manual_arima_model)
##
## Call:
## arima(x = ts_data, order = c(1, 1, 1))
##
## Coefficients:
## ar1 ma1
## 0.3584 -0.8896
## s.e. 0.0423 0.0192
##
## sigma^2 estimated as 855419: log likelihood = -6021.96, aic = 12049.91
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 12.87842 924.2558 647.6408 -44.46123 58.52916 0.8873291 0.01117124
# Step 2: Fit Automatic ARIMA Model
auto_arima_model <- auto.arima(ts_data)
summary(auto_arima_model)
## Series: ts_data
## ARIMA(1,0,2)(0,1,0)[365] with drift
##
## Coefficients:
## ar1 ma1 ma2 drift
## 0.9586 -0.6363 -0.1892 5.7093
## s.e. 0.0283 0.0583 0.0506 0.7566
##
## sigma^2 = 1599566: log likelihood = -3131.76
## AIC=6273.52 AICc=6273.68 BIC=6293.03
##
## Training set error measures:
## ME RMSE MAE MPE MAPE MASE ACF1
## Training set 5.357072 890.0137 457.0405 -44.28372 51.73145 0.1967752 0.01047273
# Step 3: Check Residuals for manual ARIMA model
residuals_manual <- residuals(manual_arima_model)
shapiro_test_manual <- shapiro.test(residuals_manual)
print(shapiro_test_manual)
##
## Shapiro-Wilk normality test
##
## data: residuals_manual
## W = 0.90528, p-value < 2.2e-16
# Plot ACF and PACF of residuals
par(mfrow = c(1, 2))
acf(residuals_manual, main = "ACF of Manual ARIMA Residuals")
pacf(residuals_manual, main = "PACF of Manual ARIMA Residuals")
# Check residuals for automatic ARIMA model
residuals_auto <- residuals(auto_arima_model)
shapiro_test_auto <- shapiro.test(residuals_auto)
print(shapiro_test_auto)
##
## Shapiro-Wilk normality test
##
## data: residuals_auto
## W = 0.774, p-value < 2.2e-16
# Step 4: Forecasting
# Forecast next 25 observations with Manual ARIMA
forecast_manual <- forecast(manual_arima_model, h = 25)
plot(forecast_manual, main = "Forecast from Manual ARIMA Model")
# Forecast next 25 observations with Automatic ARIMA
forecast_auto <- forecast(auto_arima_model, h = 25)
plot(forecast_auto, main = "Forecast from Automatic ARIMA Model")
Explanation of Steps
Fit ARIMA Models:
1- Manual ARIMA: Adjust the (p, d, q) values based on
your analysis of ACF and PACF plots. The example uses (1, 1, 1), but you
may want to experiment with different values.
2- Automatic ARIMA: The auto.arima() function automatically
selects the best ARIMA model based on AIC/BIC criteria.
Check Residuals: 1- Use the Shapiro-Wilk test to check
for normality of the residuals. A p-value > 0.05 suggests that the
residuals are normally distributed.
2- Plot ACF and PACF of the residuals to identify any patterns. Ideally,
the residuals should be white noise (no significant correlations).
Forecasting:
1-Use the forecast() function to generate forecasts for the next 25
observations. This will give you both point forecasts and confidence
intervals.
2-Plot the forecasts for visual analysis.
Comments on Forecasts
After running the code, you can compare the forecasts from both models
based on their accuracy, trend patterns, and any other insights you
might gather from the foreca plots. This will help you decide which
model is more suitable for your data.
Throughout the process of this project, I gained valuable insights and developed a deeper understanding of the subject matter. The research journey involved thorough data analysis and interpretation, which allowed me to explore various facets of the topic. Here are some of the key components of my findings and conclusions:
Key Learnings
1. Data Handling:
I improved my skills in data manipulation and visualization. Using R and
its packages enabled me to clean, transform, and visualize data
effectively, reinforcing the importance of data quality in analysis.
2. Analytical Techniques:
I became familiar with different statistical methods and how to apply
them to derive meaningful insights from the data. This hands-on
experience solidified my understanding of concepts learned in
theory.
3. Critical Thinking:
The project encouraged me to think critically about the results and
their implications. I learned to question assumptions and consider
alternative explanations for the patterns observed in the data.
Results vs. Expectations
The results obtained were somewhat aligned with my expectations but also
presented some surprises. I anticipated certain trends based on
preliminary research, yet some findings contradicted my hypotheses. This
highlighted the value of data-driven insights over preconceived
notions.
Key Findings and Takeaways
1. Trends and Patterns:
The analysis revealed significant trends that were likely obvious such
the impact of temperature and seasons on the total rental, and some that
were not immediately obvious, underscoring the importance of
comprehensive data analysis in uncovering hidden insights.
2. Implications for Practice:
The findings suggest practical applications in the relevant field, which
could inform future decision-making processes. This emphasizes the
relevance of empirical evidence in shaping policies and strategies.
3. Future Research Directions:
In this prece of work i have include or worked on the data that has only
the day related data. however we can deepen this research by i=ncluding
the data of hour. Also we can include or bring into work the holiday
data. The project identified gaps in the current literature, paving the
way for further research. Future studies could build on these findings
to explore additional variables or refine the analysis methods used.
In conclusion, this project not only enhanced my technical skills but also deepened my understanding of the subject matter. The experience reinforced the significance of a rigorous approach to research and the need to remain open to new insights that data may reveal.
MOHAMMAD ALI ASIF Dated:Wed Sep 25 09:14:16 2024